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Abstract. When a matrix A with n columns is known to be well approximated by a linear combination of 
basis matrices Bi, . . . , Bp, we can apply yl to a random vector and solve a linear system to recover 
this linear combination. The same technique can be used to obtain an approximation to A~^. A 
basic question is whether this linear system is well-conditioned. This is important for two reasons: 
a well-conditioned system means (1) we can invert it and (2) the error in the reconstruction can 
be controlled. In this paper, we show that if the Gram matrix of the Bj's is sufficiently well- 
conditioned and each Bj has a high numerical rank, then n oc plog^ n will ensure that the linear 
system is well-conditioned with high probability. Our main application is probing linear operators 
with smooth pseudodifferential symbols such as the wave equation Hessian in seismic imaging [9]. 
We also demonstrate numerically that matrix probing can produce good preconditioners for inverting 
elliptic operators in variable media. 
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1. Introduction. The earliest randomized algorithms include Monte Carlo integration and 
Monte Carlo Markov chains [1]. These are standard techniques in numerical computing with 
widespread applications from physics, econometrics to machine learning. However, they are 
often seen as the methods of last resort, because they are easy to implement but produce 
solutions of uncertain accuracy. 

In the last few decades, a new breed of randomized algorithms has been developed by the 
computer science community. These algorithms remain easy to implement, and in addition, 
have failure probabilities that are provably negligible. In other words, we have rigorous theory 
to ensure that these algorithms perform consistently well. Moreover, their time complexity 
can be as good as the most sophisticated deterministic algorithms, e.g., Karp- Rabin's pattern 
matching algorithm [17] and Karger's min-cut algorithm [16]. 

In recent years, equally attractive randomized algorithms are being developed in the nu- 
merical community. For example, in compressed sensing [4], we can recover sparse vectors with 
random measurement matrices and £^ minimization. Another interesting example is that we 
can build a good low rank approximation of a matrix by applying it to random vectors [14]. 

Our work carries a similar flavor: often, the matrix A can be approximated as a linear 
combination of a small number of matrices and the idea is to obtain these coefficients by 
applying A to a random vector or just a few of them. We call this "forward matrix probing." 
What is even more interesting is that we can also probe for by applying A to a random 
vector. We call this "backward matrix probing" for a reason that will be clear in Section 1.5. 

Due to approximation errors, the output of "backward probing" denoted as C, is only 
an approximate inverse. Nevertheless, as we will see in Section 4, C serves very well as 
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a preconditioner for inverting A, and we believe that its performance could match that of 
multigrid methods for elliptic operators in smooth media. 

We like to add that the idea of "matrix probing" is not new. For example, Chan [6, 5] et. 
al. use the technique to approximate A with a sparse matrix. Another example is the work 
by Pfander et. al. [21] where the same idea is used in a way typical in compressed sensing. 
In the next section, we will see that their set-up is fundamentally different from ours. 

1.1. Forward matrix probing. Let B = {Bi, . . . ,Bp} where each Bj G £i^>^n called a 
basis matrix. Note that B is specified in advance. Let u be a Gaussian or a Rademacher 
sequence, that is each component of u is independent and is either a standard normal variable 
or ±1 with equal probability. 

Define the matrix L G (C™^p such that its j-th column is Bju. Let A G C"*^"- be the 
matrix we want to probe and suppose A lies in the span of B. Say 

p 

A = CiBi for some ci, . . . , Cp G C. 

i=l 

Observe that Au = Yl^=i Ci{Biu) = Lc. Given the vector Au, we can obtain the coefficient 
vector c = (ci, . . . , Cp)^ by solving the linear system 

Lc = Au. (1.1) 

In practice, A is not exactly in the span of a small B and Equation (1.1) has to be solved 
in a least squares sense, that is c = L~^{Au) where is the pseudoinverse of L. 

We will assume that p < n. Otherwise there are more unknowns than equations and there 
is no unique solution if there is any. This differs from the set-up in [21] where n ^ p but A 
is assumed to be a sparse linear combination of . . . , Bp. 

1.2. Conditioning of L. Whether Equation (1.1) can be solved accurately depends on 
cond(L), the condition number of L. This is the ratio between the largest and the smallest 
singular values of L and can be understood as how different L can stretch or shrink a vector. 

Intuitively, whether cond(L) is small depends on the following two properties of B. 

1. The Sj's "act differently" in the sense that {Bj, Bk) — 5jk for any 1 < j. A; < p} 

2. Each Bi has a high rank so that Biu, . . . , BpU G C" exist in a high dimensional space. 
When B possesses these two properties and p is sufficiently small compared to n, it makes 

sense that L's columns, Biu, . . . , BpU, are likely to be independent, thus guaranteeing that L 
is invertible, at least. 

We now make the above two properties more precise. Let 

M = L*L G CP'^P and N = KM. (1.2) 

Clearly, cond(M) = cond(L)^. If EM is ill-conditioned, there is little chance that M or L 
is well-conditioned. This can be related to Property 1 by observing that 

Njk = EMjk = Tr{B,*B,) = {Bj,Bk) . (1.3) 



'^Note that (•, •) is the Frobenius inner product and Sjk is the Kronecker delta. 
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If {Bj,Bk) ~ 5jk, then the Gram matrix N is approximately the identity matrix which 
is well-conditioned. Hence, a more quantitative way of putting Property 1 is that we have 
control over ii{B) defined as follows. 

Definition 1.1. Let B = . . . , he a set of matrices. Define its condition number 
n{B) as the condition number of the matrix N G C^^p where Njj. = {Bj^B^)- 

On the other hand, Property 2 can be made precise by saying that we have control over 
\{B) as defined below. 

Definition 1.2. Let A € C™^". Define its weak condition number^ as 

W^Wf 

Let B be a set of matrices. Define its (uniform) weak condition number as 

X(B) =maxA(^). 



We justify the nomenclature as follows. Suppose A G ([^nxn j^g^g condition number k, 
then ll^ll^ = Y17=i'^i — ^'^m'm — '>T'\\A\\'^ /k'^ . Taking square root, we obtain X{A) < k. In 
other words, any well-conditioned matrix is also weakly well-conditioned. And like the usual 
condition number, A(^) > 1 because we always have ||^||p < n^^'^ ll^ll- 

The numerical rank of a matrix A is ||j4||^ / ||^||^ = nX{A)~^, thus having a small X{A) 
is the same as having a high numerical rank. We also want to caution the reader that X{B) is 
defined very differently from k,{B) and is not a weaker version of k{B). 

Using classical concentration inequalties, it was shown [9] that when X{B) and k{B) are 
fixed, p = 0(ni/2)3 will ensure that L is well-conditioned with high probability. 

In this paper, we establish a stronger result, namely that p = 0{n) suffices. The implica- 
tion is that we can expect to recover 0{n) instead of 0(n^/^) coefficients. The exact statement 
is presented below. 

Theorem 1.3 (Main result). Let Ci,C2 > be numbers given by Remark B.l in the Ap- 
pendix. Let B = {Bi, . . . , Bp] where each Bj € C"^". Define L G C"^^ such that its j-th 
column is Bju where u is either a Gaussian or Rademacher sequence. Let M = L*L, N = EM 
K = n{B) and X = X{B). Suppose 

n > p (CKXlogn)'^ for some C > 1. 

Then 

P ( IIM - N\\ > MM ) < 2C2pn^~" where a = 
\ 1^ J eCi 

The number Ci is small. C2 may be large but it poses no problem because n~°' decays 
very fast with larger n and C. With t = 1/2, we deduce that with high probability, 

cond(Af ) < 2k + 1. 

■^Throughout the paper, ||-|| and denote the spectral and Frobenius norms respectively. 
''Note that 0(n) denotes O(nlog'^n) for some c > 0. In other words, ignore log factors. 
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In general, we let < t < 1 and for the probability bound to be useful, we need a > 2, 
which implies C > 2eCi > 1. Therefore the assumption that C > 1 in the theorem can be 
considered redundant. 

We remark that Rauhut and Tropp have a new result (a Bernstein-like tail bound) that 
may be used to refine the theorem. This will be briefly discussed in Section 4.1 where we 
conduct a numerical experiment. 

Note that when n is a Gaussian sequence, M resembles a Wishart matrix for which the 
distribution of the smallest eigenvalue is well-studied [11]. However, each row of L is not 
independent, so results from random matrix theory cannot be used in this way. 

An intermediate result in the proof of Theorem 1.3 is the following. It conveys the essence 
of Theorem 1.3 and may be easier to remember. 

Theorem 1.4. Assume the same set-up as in Theorem 1.3. Suppose n = 0{p). Then 

E||M-iV|| < C{logn) \\N\\ {p/n)^^'^X for some C > 0. 



A numerical experiment in Section 4.1 suggests that the relationship between p and n 
is not tight in the log factor. Our experiment show that for E||M — A'^H/IIA^H to vanish as 
p — )■ CO, n just needs to increase faster than plog(np), whereas Theorem 1.4 requires n to 
grow faster than plog^ n. 

Next, we see that when L is well-conditioned, the error in the reconstruction is also small. 

Proposition 1.5. Assume the same set-up as in Theorem 1.3. Suppose A = X]j=i ^j^j + 
where \\E\\ < e and assume whp, 



|M- A^ll < 



t\\N\ 



for some < t < 1. 



Let c = L~^Au he the recovered coefficients. Then whp, 



<0\eX 



Kp 
1-t 



If e = o{p ^/^), then the proposition guarantees that the overall error goes to zero as 
p — 7- CO. Of course, a larger n and more computational effort are required. 

1.3. Multiple probes. Fix n and suppose p > n. L is not going to be well-conditioned or 
even invertible. One way around this is to probe A with multiple random vectors ui, . . . ,Uq G 
at one go, that is to solve 

L'c = A'u, 

where the j-th column of L' and A'u are respectively 





\ 


/ Am \ 




and 
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For this to make sense, A' = Iq <^ A where Iq is the identity matrix of size q. Also 
define B'j = Jg 05 Bj and treat the above as probing A' assuming that it hes in the span of 
B' = {B[,...,B'p}. 

Regarding the conditioning of L', we can apply Theorem 1.3 to A' and B'. It is an 
easy exercise (cf. Proposition A.l) to see that the condition numbers are unchanged, that 
is k{B) = k{B') and X{B) = X{B'). Applying Theorem 1.3 to A' and B', we deduce that 
cond(L) < 2k + 1 with high probability provided that 

nq oc p{k,\ log n)^. 

Remember that A has only mn degrees of freedom; while we can increase q as much as we 
like to improve the conditioning of L, the problem set-up does not allow p > mn coefficients. 
In general, when A has rank h, its degrees of freedom is n(m + n — n) by considering its SVD. 

1.4. When to probe. Matrix probing is an especially useful technique when the following 
holds. 

1. We know that the probed matrix A can be approximated by a small number of basis 
matrices that are specified in advance. This holds for operators with smooth pseudod- 
ifferential symbols, which will be studied in Section 3. 

2. Each matrix Bi can be applied to a vector in 0(max(m, n)) time using only 0(max(m, n)) 
memory. 

The second condition confers two benefits. First, the coefficients c can be recovered 
fast, assuming that u and Au are already provided. This is because L can be computed in 
0(max(?7i, n)p) time and Equation (1.1) can be solved in 0{mp'^+p^) time by QR factorization 
or other methods. In the case where increasing m, n does not require a bigger B to approximate 
A, p can be treated as a constant and the recovery of c takes only 0(max(m,n)) time. 

Second, given the coefficient vector c, A can be applied to any vector v by summing over 
BiV^s in 0(max(m, n)p) time . This speeds up iterative methods such as GMRES and Arnoldi. 

1.5. Backward matrix probing. A compelling application of matrix probing is computing 
the pseudoinverse A'^ of a matrix A G c^x" when is known to be well-approximated in 
the space of some B = {Bi, . . . ,Bp}. This time, we probe A'^ by applying it to a random 
vector V = Au where n is a Gaussian or Rademacher sequence that we generate. 

Like in Section 1.1, define L G C"^^ such that its j-th column is BjV = BjAu. Suppose 
A~^ = X^f^^ CiBi for some ci, . . . ,Cp G C. Then the coefficient vector c can be obtained by 
solving 

Lc = A^v = A-^Au. (1.4) 

The right hand side is u projected onto null(^)-'- where null(A) is the nullspace of A. 
When A is invertible, A~^Au is simply u. We call this "backward matrix probing" because 
the generated random vector u appears on the opposite side of the matrix being probed in 
Equation (1.4). The equation suggests the following framework for probing A~^ . 

Algorithm 1 (Backward matrix probing). Suppose A'^ = Y^^i=i^iBi- The goal is to retrieve 
the coefficients ci, . . . ,Cp. 

1. Generate u ~ A^(0, 1)" iid. 

2. Compute v = Au. 
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3. Filter away u's components in nuU{A). Call this u. 

4. Compute L by setting its j-column to Bjv. 

5. Solve for c the system Lc = u in a least squares sense. 

In order to perform the filtering in Step 3 efficiently, prior knowledge of A may be needed. 
For example, if A is the Laplacian with periodic boundary conditions, its nullspace is the set 
of constant functions and Step 3 amounts to subtracting the mean from u. A more involved 
example can be found in [9]. In this paper, we invert the wave equation Hessian, and Step 3 
entails building an illumination mask. Further comments on [9] are located in Section 4.5 of 
this paper. 

For the conditioning of L, we may apply Theorem 1.3 with B replaced with Ba ■= 
{BiA, . . . ,BpA} since the j-th column of L is now BjAu. Of course, k{Ba) and X{Ba) can 
be very different from k{B) and X{B); in fact, k{Ba) and X{Ba) seem much harder to control 
because it depends on A. Fortunately, as we shall see in Section 3.5, knowing the "order" of 
A~^ as a pseudodifferential operator helps in keeping these condition numbers small. 

When A has a high dimensional nullspace but has comparable nonzero singular values, 
X{Ba) may be much larger than is necessary. By a change of basis, we can obtain the following 
tighter result. 

Corollary 1.6. Let Ci,C2 > be numbers given by Remark B.l in the Appendix. Let 
A e C'"''", h = rank(A) and Ba = {BiA, BpA} where each Bj G C"^'". Define L G C"^^ 
such that its j-th column is BjAu where u ~ A^(0, 1)" iid. Let M = L*L, N = EM, k = k{Ba) 
and X = (n/n)^/^A(i3yi). Suppose 



Notice that fi = rank(j4) has taken the role of n, and our A is now maxi<j<p "||^_|^ | | — 
which ignores the n — h zero singular values of each BjA and can be much smaller than X{Ba)- 



2.1. Proof of Theorem 1.3. Our proof is decoupled into two components: one linear 
algebraic and one probabilistic. The plan is to collect all the results that are linear algebraic, 
deterministic in nature, then appeal to a probabilistic result developed in the Appendix. 

To facilitate the exposition, we use a different notation for this section. We use lower case 
letters as superscripts that run from 1 to p and Greek symbols as subscripts that run from 1 
to n or m. For example, the set of basis matrices is now B = {B^, . . . , 5^}. 

Our linear algebraic results concern the following variables. 



1. Let Ti^ = B^*B^ e C"^" and T^^ G C^^p such that the (j, A;)-th entry of T^^ is the 
(C, r?)-th entry of T^^. 



h > p(CKAlogn)^ for some C > 1. 



Then 




2. Proofs. 



3 

4, 
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The reason for introducing T is that M can be written as a quadratic form in T^^ with 
input u: 

M = ^ Ui:Ur,T^jj. 

Since has unit variance and zero mean, N = EM = Yl(=i 

Probabihstic inequalties apphed to M will involve T^^, which must be related to B. The 
connection between these n by n matrices and p hy p matrices lies in the identity 

m 

nv=EBaB'cv (2-1) 

The linear algebraic results are contained in the following propositions. 
Proposition 2.1. For any I < C,rj < n, 

Hence, T^^,N are all Hermitian. Moreover, they are positive semidefinite. 

Proof. Showing that T^^ = T*^ is straightforward from Equation (2.1). We now check that 

T^^ is positive semidefinite. Let v G C^. By Equation (2.1), v*T^^v = X]( Z^jfc ^''^'^-^Cg-^CC ~ 



v'^B^^ > 0. It follows that = T^^ is also positive semidefinite. 
^reposition 2.2. 

Q^f' = Tr{B^*SB^) andQ= ^ T/r^T^^. 



Proof. By Equation (2.1), Q^'' = ^i{T^^,T^'') = Y,iTt{B^* B^B^* B''). The summation 
and trace commute to give us the first identity. Similarly, the {j, A:)-th entry of T^fjT^^ is 
{T^\Tj^) = Y.iTi{B^*B^Bi*B^). Cycle the terms in the trace to obtain Qi^. ■ 
Proposition 2.3. Let u e be a unit vector. Define U = ^.1=1^^^^ ^ C™''". Then 



Proof. \\U\\'^p = Ti{U*U) = Tr(^j-^ u^u^B^* B^). The sum and trace commute and due to 
Equation (1.3), \\Ufp = Y^j^vJu^Ni'' < ||iv||. ■ 
Proposition 2.4. 

IIQII < 

Proof. Q is Hermitian, so \\Q\\ = ma.XuU*Qu where u £ has unit norm. Now let 
u be an arbitrary unit vector and define U = 'YTk=i'^^^^- Proposition 2.2, u*Qu = 
Y.jk^u^Q^'' = Tv{J2jk^i^u''B^*SB'') = Tt{U*SU). Since S is positive definite, it follows 

from ''\\AB\\p < \\A\\\\B\\p'' that u*Qu = \\S^/^U\\p < ||5|| By Proposition 2.3, 

u*Qu < \\S\\ \\N\\. U 
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Proposition 2.5. For any I < j <p, 



It follows that 



WQW 



< pX^ \\Nf /n. 



Proof. We begin by noting that ||A^|| > maxj lA^-'-'l = maxj {B\B^] 
Definition 1.2, \\B^ 



ll-B-'llp. From 

< An^^/^ < An^^/^ ||A^||"'^^^ for any 1 < j < p, wliich is our first 

inequality. It follows that \S\ < X]j=i ||-^"'||^ — P^^ 11-^11 Z'^- ^PP^Y Propositions 2.4 and 2.2 
to obtain the second inequality. ■ 

Proposition 2.6. F,G are Hermitian, and 



max I 



|F||,||G||)< A2||iV||(p/n). 



Proof That F,G are Hermitian follow from Proposition 2.1. Define F' = (T^^) another 
block matrix. Since reindexing the rows and columns of F does not change its norm, ||F|| = 

||F'||. By Proposition 2.5, ||F'f < Ej,fc=i \\T^''f < Ej,fc=i < -^^ ll^f {p/n)'^- 

The same argument works for G. ■ 

We now combine the above linear algebraic results with a probabilistic result in Appendix 
B. Prepare to apply Proposition B.6 with Aij replaced with T^^. Note that R = T^rjT^jj ~ 
Q by Proposition 2.2. Bound a using Propositions 2.5 and 2.6: 

a = Cim^^{\\Q\\'/\\\R\\'/\\\F\\,\\G\\) 

< Ci \\N\\ max((p/n)^/2A, ip/n)X'^) 

< Ci \\N\\ {p/nf/^\. 

The last step goes through because our assumption on n guarantees that (p/n)^/^A < 1. 
Finally, apply Proposition B.6 with t \\N\\ / n = eau. The proof is complete. 

2.2. Sketch of the proof for Theorem 1.4. Follow the proof of Proposition B.6. Letting 
s = log n, we obtain 

E||M-iV|| < (E||M- iVf)^/^ 

< Ci(2C2np)V«smax(||Qf /2 , \\Rf^ , \\F\\ , \\G\\) 

< C(logn) IIA^II {p/n)^/^X. 

2.3. Proof of Proposition 1.5. Recall that A is approximately the linear combination 
'^^=1 dP B\ while X]j=i B^ is the recovered linear combination. We shall first show that the 
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recovered coefficients c is close to d: 



\d — c\\ = \\L~^Au — c|| 

= \\L+{Lc + Eu) - c\\ 



< e \\u\ 



1/2 



.a-t)\\N\ 

Let V be a unit n-vector. Let L' he a, n x p matrix such that its j-th cohimn is B^v. Now, 



Av-Y^ c'B^v = {L'd + Ev) - L'c = Ev + L'{d - c). 



Combining the two equations, we have 



A 



< e + e\\L'\ 



a-t)\\N\ 



1/2 



(2.2) 



With overwhelming probability, \\u\\ = 0{^/n). The only term left that needs to be bounded 
is This turns out to be easy because ll-B-'ll < An~^/^ ||A^||^^^ by Proposition 2.5 and 



|i^'||^< X^II^^'^ll^^^^ll^ll^'/^- 



Substitute this into Equation (2.2) to finish the proof. 

2.4. Proof of Corollary 1.6. Let u ~ A^(0, 1)" iid. Say A has a singular value decompo- 
sition EAF* where A is diagonal. Do a change of basis by letting u' = F*u ~ -/V(0, 1)" iid, 
B'. = F*BjE and B'^ = {B[A, . . . ,5^A}. Equation (1.1) is reduced to L'c = Au' where the 
j-th. column of L' is B'-Au'. 

Since Frobenius inner products, ||-|| and \\-\\p are all preserved under unitary transforma- 
tions, it is clear that k{B'j^) = k{Ba) and A(i3^) = X{Ba)- Essentially, for our purpose here, 
we may pretend that A = A. 



Let h = rank(A). If A has a large nullspace, i.e., n <^ min(m,n), then B',A has n 



n 



columns of zeros and many components of u' are never transmitted to the B'-^s anyway. We 
may therefore truncate the length of u' to n, let Bj € C"^" be B'-A with its columns of zeros 
chopped away and apply Theorem 1.3 with B replaced with B := {Bi, . . . , Bp}. Observe that 
k{B) = n{B'^), whereas X{B) = (n/n)^/^A(5^) because 
but Bj has n instead of n columns. The proof is complete. 
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B'^A 


and 

F 






B'^A 



3. Probing operators with smooth symbols. 
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3.1. Basics and assumptions. We begin by defining what a pseudodifferential symbol is. 
Definition 3.1. Every linear operator A is associated with a pseudodifferential symbol a(x,^) 

such that for any u : R'^ — J- M, 

Au{x)= [ e^^'^-''a{x,i)u{i)di (3.1) 

where u is the Fourier transform of u, that is n(^) = f^^^au{x)e~'^'^^^'^dx. 

We refrain from calling A a "pseudodifferential operator" at this point because its symbol 
has to satisfy some additional constraints that will be covered in Section 3.5. What is worth 
noting here is the Schwartz kernel theorem which shows that every linear operator A : S{W^) — )• 
5'(]R'^) has a symbol representation as in Equation (3.1) and in that integral, a{x, ^) € 5'(M'^ x 
W^) acts as a distribution. Recall that S is the Schwartz space and 5' is its dual or the space of 
tempered distributions. The interested reader may refer to [12] or [25] for a deeper discourse. 

The term "pseudodifferential" arises from the fact that differential operators have very 
simple symbols. For example, the Laplacian has the symbol a(x,^) = — 47r^ ||'?||^- Another 
example is 

Au{x) = u{x) — V • a{x) gradu{x) for some a{x) G C^(IR'^). 

Its symbol is 

d 

a{x,0 = 1 + a(x)(47r2 ||e||') - J^(27ria)a,.,a(x). (3.2) 

k=l 

Clearly, if the media a{x) is smooth, so is the symbol a{x, ^) smooth in both x and ^, an 
important property which will be used in Section 3.3. 

For practical reasons, we make the following assumptions about u : ^ R on which 
symbols are applied. 

1. ti is periodic with period 1, so only ^ € Z*^ will be considered in the Fourier domain. 

2. u is bandlimited, say u is supported on H := [— Co) Co]'^ ^ Any summation over the 
Fourier domain is by default over H.^ 

3. a{x,£,) and u{x) are only evaluated at x G X C [0,1]^^ which are points uniformly 
spaced apart. Any summation over x is by default over X. 

Subsequently, Equation (3.1) reduces to a discrete and finite form: 

Au{x) = e2"*«-^a(x, OHO- (3-3) 

We like to call a(x,(,) a "discrete symbol." Some tools are already available for manipulating 
such symbols [10]. 

3.2. User friendly representations of symbols. Given a linear operator A, it is useful 
to relate its symbol a(x,.^) to its matrix representation in the Fourier basis. This helps us 
understand the symbol as a matrix and also exposes easy ways of computing the symbols of 
A~^,A* and AB using standard linear algebra software. 

■^To have an even number of points per dimension, one can use H — [—^0,^0 — 1]'' for example. We leave 
this generalization to the reader and continue to assume ^ G [—Co, Co]'*- 
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By a matrix representation (^r;^) in Fourier basis, we mean of course that Au{r]) = 
A^^u{^) for any rj. We also introduce a more compact form of the symbol: a(j, ^) = 

f^a{x,S,)e~'^'^^^'^dx. The next few results are pedagogical and listed for future reference. 

Proposition 3.2. Let A be a linear operator with symbol a{x,^). Let (^r;^) and a{j,^) be as 

defined above. Then 

Jx ^ 

Proof. Let ?7 = ^ + j and apply the definitions. ■ 

Proposition 3.3 (Trace). Let A be a linear operator with symbol a(x,^). Then 



TriA) = Y,d{0,O=Yl [ "(^'^)^^ 



Proposition 3.4 (Adjoint). Let A andC = A* be linear operators with symbols a{x , S^) , c{x , S,) . 
Then 

'n •'y 

Proposition 3.5 (Composition). Let A,B and C = AB be linear operators with symbols 
a{x,^),b{x,(,),c{x,^). Then 

c(j,e) = E^(-?' + ^-^'OKc-e,e); 

c 



(^'0 = E / e'"'^^-^^^"-'^«(^,C)6(y,0'iy- 



We leave it to the reader to verify the above results. 

3.3. Symbol expansions. The idea is that when a linear operator A has a smooth symbol 
a(x,^), only a few basis functions are needed to approximate a, and correspondingly only a 
small B is needed to represent A. This is not new, see for example [10]. In this paper, we 
consider the separable expansion 

jk 

This is the same as expanding A as ^jk^jkBjk where the symbol for Bjk is ej{x)gk{S,)- 
With an abuse of notation, let Bjk also denote its matrix representation in Fourier basis. 
Given our assumption that ^ S [— Coi'^o]'^) we have Bjk S C"^" where n = (2^o + l)*^- As its 
symbol is separable, Bjk can be factorized as 

Bjk = -Fdiag(e,(x))J-^i diag(5fe(e)) (3.4) 



12 



where T is the unitary Fourier matrix. An alternative way of viewing is that it takes 
its input u(^), multiply by guii^ and convolve it with ej{ri), the Fourier transform of ej{x). 
There is also an obvious algorithm to apply Bji^ to u{x) in 0{n) time as outlined below. As 
mentioned in Section 1.4, this speeds up the recovery of the coefficients c and makes matrix 
probing a cheap operation. 

Algorithm 2. Given vector u{x), apply the symbol ej{x)gk{C}- 

1. Perform FFT on u to obtain u{^). 

2. Multiply u[^) by gk{C) elementwise. 

3. Perform IFFT on the previous result, obtaining '^^e^^^^'^gk{S,)u{£,)- 
4- Multiply the previous result by ej{x) elementwise. 

Recall that for L to be well-conditioned with high probability, we need to check whether N, 
as defined in Equation (1.3), is well-conditioned, or in a rough sense whether {Bj^Bk) ~ 6jk. 
For separable symbols, this inner product is easy to compute. 

Proposition 3.6. Let Bj^^Bjij^i G c^x" matrix representations (in Fourier basis) of linear 
operators with symbols ej{x)gk{Cj ^.f^d eji{x)gk>{C}- Then 

{Bjk,Bj'k') = {ej,ej/) {gk,gk') 

where {ej,eji) = ^Y17=i^ji^i)^j'i^i) ^''^^ xi,...,Xn are points in [0,1]^ uniformly spaced, 

and {gk,gk') = T,^gk{09k{0- 

Proof. Apply Propositions 3.3, 3.4 and 3.5 with the symbols in the a{rj,^,) form. ■ 

To compute X{B) as in Definition 1.2, we examine the spectrum of Bjk for every j, k. A 

simple and relevant result is as follows. 

Proposition 3.7. Assume the same set-up as in Proposition 3.6. Then 

crminiBjk) > min|ej(x)| min|5tfc(^)|; CF^s.^{Bjk) < max 1 6^(2;) | max 1 5^.(^)1. 

Proof. In Equation (3.4), J^diag(e-'(x))J-"~^ has singular values |ej(x)| as x varies over X, 
defined at the end of Section 3.1. The result follows from the min-max theorem. ■ 

As an example, suppose a(x,^) is smooth and periodic in both x and ^. It is well-known 
that a Fourier series is good expansion scheme because the smoother a(a;,^) is as a periodic 
function in x, the faster its Fourier coefficients decay, and less is lost when we truncate the 
Fourier series. Hence, we pick^ 

e,(x) = e2-^'-^; ^,(0 = e^-'^'^^?), (3.5) 

where ip{0 = (C + Co)/{'^^o + 1) maps ^ into [0, l]'^. 

Due to Proposition 3.6, = EM is a multiple of the identity matrix and k.{B) = 1 where 
13 = {Bjk}. It is also immediate from Proposition 3.7 that \{Bjk) = 1 for every j, /c, and 
\{B) = 1. The optimal condition numbers of this B make it suitable for matrix probing. 



Actually, exp(27rifc^o/(25o + 1)) does not vary with ^, and we can use = '^/(2^o + !)• 
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3.4. Chebyshev expansion of symbols. The symbols of differential operators are poly- 
nomials in and nonperiodic. When probing these operators, a Chebyshev expansion in ^ is 
in principle favored over a Fourier expansion, which may suffer from the Gibbs phenomenon. 
However, as we shall see, n{B) grows with p and can lead to ill-conditioning. 

For simplicity, assume that the symbol is periodic in x and that ej{x) = e^'^^y^ , Applying 
Proposition 3.2, we see that Bjk is a matrix with a displaced diagonal and its singular values 
ai'e {gk{(,))s,eE- (Recall that we denote the matrix representation (in Fourier basis) of Bj^ as 
Bjf^ as well.) 

Let Tfc be the fc-th Chebyshev polynomial. In ID, we can pick 

gk(.0 = Tk(.a^o)ioTk = l,...,K. (3.6) 
Define ||Tfc||2 = {J^^__^Tk{z)^dz)^/'^ . By approximating sums with integrals, X{Bjk) ~ 

1 /2 

2k'^-i ) • Notice that there is no (1 — z^) ^^"^ weight factor in the definition 
of llTfcllg because ej(x)Tfc(^) is treated as a pseudodifferential symbol and has to be evaluated 
on the uniform grid. In practice, this approximation becomes very accurate with larger n and 
we see no need to be rigorous here. As k increases, X{Bji^) approaches \/2. More importantly, 
KBjk) < X{Bji) for any j,k, so 

X{B) = Vs. 

Applying the same technique to approximate the sum {gk,9k')j we find that {gk^dk') oc 
{1 — {k + fc')^)~"^ -|- (1 — (fc — k')'^)~^ when k + k' is even, and zero otherwise. We then compute 
N = EM for various K and plot k,{B) versus K, the number of Chebyshev polynomials. As 
shown in Figure 3.1(a), 

k{B) ~ 1.3i^. 

This means that if we expect to recover p = 0{n) coefficients, we must keep K fixed. 
Otherwise, if p = ^ only p = 0(n^/^) are guaranteed to be recovered by Theorem 1.3. 
In 2D, a plausible expansion is 

Qkii) = e^'^^-^^^Tk.Mm) for l<k2<K (3.7) 

where k = [ki, k2) and ^{r) = {■\/2r / — 1 maps ||^|| into [—1, 1]. We call this the "Chebyshev 
on a disk" expansion. 

/ 1 1 \-i/2 
The quantity X{Bjk) is approximately 2 ( fr^^_i Iy=-i Tkitpi^, y))'^dx dyj where ij{x, y) = 

{2x'^ + 2y^)^/^ — 1. The integral is evaluated numerically and appears to converge 6 to V2 for 
large k2- Also, /c2 = 1 again produces the worst X{Bjk) and 

X{B) < 2.43.7 

^This is because when we truncate the disk of radius $,oV2 to a square of length 2^0, most is lost along the 
vertical axis and away from the diagonals. However, for large k, Tk oscillates very much and the truncation 
does not matter. If we pretend that the square is a disk, then we are back in the ID case where the answer 
approaches y/2 for large k. 

■'The exact value is 2(4 - |^sinh"^(l))"^''^ 



14 




Figure 3.1. Let K he the number of Chebyshev polynomials used in the expansion of the symbol, see 
Equation (3.6) and (3.7). Observe that in ID, k,{B) = O(A') while in 2D, k{B) = 0{K^). These condition 
numbers mean that we cannot expect to retrieve p = 0{n) parameters unless K is fixed and independent ofp,n. 



As for ii{B), observe that when ki / k'-^, \ 9kik2^ 9k[k'.^j = due to symmetry'^, whereas 
when ki = A;^, the inner product is proportional to n and is much larger. As a result, the g^s 
with different kis hardly interact and in studying n{B)^ one may assume that ki = k'^ = 0. 
To improve k{B), we can normalize gk such that the diagonal entries of are all ones, that 

is 9i{0=9k{0/\\9k{m- 

This yields another set of basis matrices B'. Figure 3.1(b) reveals that 

k{B) = 0{K^) and k{B') ~ k{B). 

The latter can be explained as follows: we saw earlier that {Bj^, Bj^) converges as ^2 
increases, so the diagonal entries of A^ are about the same and the normalization is only a 
minor correction. 

If a{x,S^) is expanded using the same number of basis functions in each direction of x 
and ^, i.e., K = p^^^, then Theorem 1.3 suggests that only p = Oip?^^) coefficients can be 
recovered. 

To recap, for both ID and 2D, \{B) is a small number but k,{B) increases with K. For- 
tunately, if we know that the operator being probed is a second order derivative for example, 
we can fix AT = 2. 

Numerically, we have observed that the Chebyshev expansion can produce dramatically 
better results than the Fourier expansion of the symbol. More details can be found in Section 
4.3. 



The ^ and — ^ terms cancel each other. Only ^ = contributes to the sum. 
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3.5. Order of an operator. In standard texts, A is said to be a pseudodifferential operator 
of order w if its symbol a{x,S^) is in C°°(M'^ x M'^) and for any multi-indices a, (3, there exists 
a constant Ca/s such that 

|9f afa(x,OI < C^p (0"-'"' for all e, where (0 = 1 + 1101 • 

Letting a = /3 = 0, we see that such operators have symbols that grow or decay as 
(1 + IIOI)""- As an example, the Laplacian is of order 2. The factor 1 prevents (0 from 
blowing up when ^ = 0. There is nothing special about it and if we take extra care when 
evaluating the symbol at ^ = 0, we can use 

(0 = 1101- 

For forward matrix probing, if it is known a priori that a(x,0 behaves like (0"'; it makes 
sense to expand a(x,0 instead. Another way of viewing this is that the symbol of the 

operator Bji^ is modified from ej{x)gk{S^) to ej{x)gk{^) (O"' to suit A better. 

For backward matrix probing, if A is of order z, then A~^ is of order —z and we should 
replace the symbol of Bj^ with ej{x)gk[^) (O"""- We believe that this small correction has an 
impact on the accuracy of matrix probing, as well as the condition numbers h{Ba) and \{Ba)- 

Recall that an element of Ba is Bj^A. If A is of order w and Bjj. is of order 0, then Bj^A 
is of order w and X(BjkA) will grow with n'^, which will adversely affect the conditioning of 
matrix probing. However, by multiplying the symbol of Bj^ by , we can expect Bj^A 

to be order and that X{BjkA) is independent of the size of the problem n. The argument is 
heuristical but we will support it with some numerical evidence in Section 4.3. 

4. Numerical examples. We carry out four different experiments. The first experiment 
suggests that Theorem 1.4 is not tight. The second experiment presents the output of back- 
ward probing in a visual way. In the third experiment, we explore the limitations of backward 
probing and also tests the Chebyshev expansion of symbols. The last experiment involves the 
forward probing of the foveation operator, which is related to human vision. 

4.1. ID statistical study. We are interested in whether the probability bound in Theorem 
1.3 is tight with respect to p and n, but as the tail probabilities are small and hard to estimate, 
we opt to study the first moment instead. In particular, if Theorem 1.4 captures exactly the 
dependence ofE||M — A^||/||A^|| onp and n, then we would need n to grow faster than plog^ n 
for E||M — A^||/||A^|| to vanish, assuming X{B) is fixed. 

For simplicity, we use the Fourier expansion of the symbol in ID so that X{B) = i^{B) = 1. 
Let J be the number of basis functions in both x and ^ and p = J^. Figure 4.1(a) suggests that 
E||M — A^||/||A^|| decays to zero when n = plog'^p and c > 1. It follows from the previous 
paragraph that Theorem 1.4 cannot be tight. 

Nevertheless, Theorem 1.4 is optimal in the following sense. Imagine a more general bound 

\\M - N\\ 

^^jiVjP^ < (log" n) l^^j for some a, /3 > 0. (4.1) 

In Figure 4.2(a), we see that for various values p/n, a = 1 since the graphs are linear. 
On the other hand, if we fix p and vary n, the log-log graph of Figure 4.2(b) shows that 
/? = 1/2. Therefore, any bound in the form of Equation (4.1) is no better than Theorem 1.4. 
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Figure 4.1. Consider the Fourier expansion of the symbol. J is the number of basis functions in x and 
^, so p = . Let n = plog'^p. Figure (a) shows that the estimated E ||M — A^|| / || A'"|| decays for c > 1.1 
which suggests that Theorem 1.4 is not tight. In Figure (b), we estimate P(||M — A^|| / ||A'^|| > t) by sampling 
||M - iV|| / |liV|| 10^ times. The tail probability appears to be subgaussian for small t and subexponential for 
larger t. 



Next, we fix p = 25, n = 51 and sample \\M — N\\ / \\N\\ many times to estimate the 
tail probabilities. In Figm'e 4.1(b), we see that the tail probability ofP(||M-iV||/||iV|| > t) 
decays as exp(— cit) when t is big, and as exp(— C2t^) when t is small, for some positive numbers 
ci,C2- This behavior may be explained by Rauhut and Tropp's yet published result. 

4.2. Elliptic equation in ID. We find it instructive to consider a ID example of matrix 
probing because it is easy to visualize the symbol a{x,(^). Consider the operator 

d dtii xi 

Au{x) = -— a(x) — ^ where a{x) = 1 + 0.4cos(47rx) + 0.2 cos(67rx). (4.2) 

ax dx 

Note that we use periodic boundaries and A is positive semidefinite with a one dimensional 
nullspace consisting of constant functions. 

We probe for A'^ using Algorithm 1 and the Fourier expansion of its symbol or Equation 
(3.5). Since A is of order 2, we premultiply gkiO by explained in Section 3.5. 

In the experiment, n = 201 and there are two other parameters J, K which are the number 
of e/s and g^s used in Equation (3.5). To be clear, —^^-^ < i < ^^-j^ and —^^-^ < k < -^^j-^- 

Let C be the output of matrix probing. In Figure 4.3(b), we see that J = K = 5 is not 
enough to represent A~^ properly. This is expected because our media a{x) has a bandwidth 
of 7. We expect J = = 13 to do better, but the much larger p leads to overfitting and a 
poor result, as is evident from the wobbles in the symbol of C in Figure 4.3(c). Probing with 
four random vectors, we obtain a much better result as shown in Figure 4.3(d). 
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Figure 4.2. Consider bounding E \\M — N\\ / \\N\\ by (log" n){p/n)^ . There is little loss in replacing logn 
with logp in the simulation. In Figure (a), the estimated E ||A/ — 7V|| / ||A''|| depends linearly on logp, so a > 1. 
In Figure (b), we fix p and find that for large n, P — 1/2. The conclusion is that the bound m Theorem 1.4 has 
the best a, p. 



4.3. Elliptic Equation in 2D. In this section, we extend the previous set-up to 2D and 
address a different set of questions. Consider the operator A defined as 

Au{x) = —V • a{x)Vu{x) where a{x) = — + cos^(7r7Xi) sin^(7r7X2). (4-3) 

The positive value T is called the contrast while the positive integer 7 is the roughness 
of the media, since the bandwidth of a{x) is 27 + 1. Again, we assume periodic boundary 
conditions such that ^'s nullspace is exactly the set of constant functions. 

Let C be the output of the backward probing of A. As we shall see, the quality of C drops 
as we increase the contrast T or the roughness 7. 

Fix n = 101^ and expand the symbol using Equation (3.5). Let J = be the number of 
basis functions used to expand the symbol in each of its four dimensions, that is p = J^. 

In Figure 4.4(b), we see that between J = 27 — 1 and J = 27 + 1, the bandwidth of 
the media, there is a marked improvement in the preconditioner, as measured by the ratio 
cond(CA)/cond(^).9 

On the other hand. Figure 4.4(a) shows that as the contrast increases, the preconditioner 
C degrades in performance, but the improvement between J = 27 — 1 and 27 + 1 becomes 
more pronounced. 

The error bars in Figure 4.4 are not error margins but a where is the unbiased estimator 
of the variance. They indicate that cond(CA)/cond(A) is tightly concentrated around its 



^ Since A has one zero singular value, cond(j4) actually refers to the ratio between its largest singular value 
and its second smallest singular value. The same applies to CA. 
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Figure 4.3. Let A be the ID elliptic operator in Equation (4-2) and be its pseudoinverse. Let C be the 
output of backward matrix probing with the following parameters: q is the number of random vectors applied 
to A'^ ; J,K are the number ofcj's and g^'s used to expand the symbol of A'^ in Equation (3.5). Figure (a) 
is the symbol of A'^ . Figure (b) is the symbol of C with ,J = K = 5. It lacks the sharp features of Figure (a) 
because B is too small to represent A'^ well. With J — K = 13, probing with only one random vector leads 
to ill- conditioning and an inaccurate result in Figure (b). In Figure (c), four random vectors are used and a 
much better result is obtained. Note that the symbols are multipled by (^)^ for better visual contrast. 



mean, provided J is not too much larger than is necessary. For instance, for 7 = 1, J = 3 
already works well but pushing to J = 9 leads to greater uncertainty. 

Next, we consider forward probing of A using the "Chebyshev on a disk" expansion or 
Equation (3.7). Let m be the order correction, that is we multiply gk{0 by (0™ = IICIP- Let 
C be the output of the probing and K be the number of Chebyshev polynomials used. 

Fix n = 55^, T = 10, 7 = 2 and J = 5. For in = and K = 3, i.e., no order correction 
and using up to quadratic polynomials in we obtain a relative error ||C — j4||/||A|| that is 
less than 10"^^. On the other hand, using Fourier expansion, with i^' = 5 in the sense that 
— ^^^ < ki,k2 < ' relative error is on the order of 10~^. The point is that in this 
case, A has an exact "Chebyshev on a disk" representation and probing using the correct B 
enables us to retrieve the coefficients with negligible errors. 

Finally, we consider backward probing with the Chebyshev expansion. We use J = 5, 
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Figure 4.4. Let A be the operator defined in Equation (4-3) and C be the output of backward probing. In 
Figure (b), we fix T = 10'* and find that as J goes from 27 — 1 to 27 + 1, the bandwidth of the media, the quality 
of the preconditioner C improves by a factor between 10°'^ and 10. In Figure (a), we fix 'y — 2 and find that 
increasing the contrast worsens cond{C A) / cond{A) . Nevertheless, the improvement between J — 3 and J = 5 
becomes more distinct. The error bars correspond to a where is the estimated variance. They indicate that 
C is not just good on average, but good with high probability. 



7 = 2 and T = 10. Figure 4.5 shows that when m = —2, the condition numbers \{Ba) and 
ti{BA) are minimized and hardly increases with n. This emphasizes the importance of knowing 
the order of the operator being probed. 

4.4. Foveation. In this section, we forward-probe for the foveation operator, a space- 
variant imaging operator [7], which is particularly interesting as a model for human vision. 
Formally, we may treat the foveation operator A as a Gaussian blur with a width or standard 
deviation that varies over space, that is 

f ^ ( — 11*^ — 

Au{x) = / K{x,y)u{y)dy where K{x,y) = exp , (4.4) 

7r2 w{x)V27r \ 2w^{x) J 

where w{x) is the width function which returns only positive real numbers. 

The resolution of the output image is highest at the point where w{x) is minimal. Call 
this point xq. It is the point of fixation, corresponding to the center of the fovea. For our 
experiment, the width function takes the form of w{x) = {a \\x — xq\\^ + PY^'^ ■ Our images 
are 201 x 201 and treated as functions on the unit square. We choose xq = (0.5, 0.5) and 
a,/3 > such that w{xq) = 0.003 and w{l, 1) = 0.012. 

The symbol of A is a(x, ^) = exp(— 27r^iD(a;)^ IICII^)) and we choose to use a Fourier series or 
Equation (3.5) for expanding it. Let C be the output of matrix probing and z be a standard 
test image. Figure 4.6(c) shows that the relative f' error ||C2; — A2;||^2 / ||^^;||£2 decreases 
exponentially as p increases. In general, forward probing yields great results like this because 
we know its symbol well and can choose an appropriate B. 
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Figure 4.5. Consider the backward probing of A m Equation (4-3), a pseudodifferential oeprator of order 2. 
Perform order correction by multiplying gk{£.) by in the expansion of the symbol. See Section 3.5. Observe 
that at q — —2, the condition numbers X{Ba) and k{Ba) are minimized and hardly grow with n. 



4.5. Inverting the wave equation IHessian. In seismology, it is common to recover the 
model parameters m, which describe the subsurface, by minimizing the least squares misfit 
between the observed data and F{m) where F, the forward model, predicts data from m. 

Methods to solve this problem can be broadly categorized into two classes: steepest descent 
or Newton's method. The former takes more iterations to converge but each iteration is 
computationally cheaper. The latter requires the inversion of the Hessian of the objective 
function, but achieves quadratic convergence near the optimal point. 

In another paper, we use matrix probing to precondition the inversion of the Hessian. 
Removing the nullspace component from the noise vector is more tricky (see Algorithm 1) 
and involves checking whether "a curvelet is visible to any receiver" via raytracing. For 
details on this more elaborate application, please refer to [9]. 

5. Conclusion and future work. When a matrix A with n columns belongs to a specified 
p-dimensional subspace, say A = Yl\=i CiBi, we can probe it with a few random vectors to 
recover the coefficient vector c. 

Let q be the number of random vectors used, k be the condition number of the Gram 
matrix oi Bi, . . . , Bp and A be the "weak condition number" of each Bi (cf. Definition 1.2) 
which is related to the numerical rank. From Theorem 1.3 and Section 1.3, we learn that when 
nq oc p(KAlogn)^, then the linear system that has to be solved to recover c (cf. Equation 
(1.1)) will be well-conditioned with high probability. Consequently, the reconstruction error 
is small by Proposition 1.5. 

The same technique can be used to compute an approximate A~^, or a preconditioner for 
inverting A. In [9], we used it to invert the wave equation Hessian — here we demonstrate 
that it can also be used to invert elliptic operators in smooth media (cf. Sections 4.2 and 4.3). 

Some possible future work include the following. 
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Figure 4.6. Let A he the foveation operator in Equation (4-4) ^nrf C be the output of the forward probing 
of A. Figure (a) is the test image z. Figure (b) is Cz and it shows that C behaves like the foveation operator 
as expected. Figure (c) shows that the relative error (see text) decreases exponentially with the number of 
parameters p = J . 

1. Extend the work of Pfander, Rauhut et. al. [21, 20, 22]. These papers are concerned 
with sparse signal recovery. They consider the special case where B contains matri- 
ces each representing a time- frequency shift, but A is an unknown linear combination of 
only p of them. The task is to identify these p matrices and the associated coefficients 
by applying A to noise vectors. Our proofs may be used to establish similar recovery 
results for a more general B. However, note that in [20], Pfander and Rauhut show 
that n oc plogn suffices, whereas our main result requires an additional log factor. 

2. Build a framework for probing f{A) interpreted as a Cauchy integral 

f{A) = ^^£f{z){zI-Ar'dz, 

where F is a closed curve enclosing the eigenvalues of A. For more on approximating 
matrix functions, see [13, 15]. 

3. Consider expansion schemes for symbols that highly oscillate or have singularities that 
are well-understood. 

Appendix A. Linear algebra. 
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Recall the definitions of k{B) and X{B) at the beginning of the paper. The following 
concerns probing with multiple vectors (cf. Section 1.3). 

Proposition A.l. Let Ig € C""? be the identity. Let B = {Bi, . . .,Bp}. Let B'- = Iq® Bj 
andB' = {B[,...,B'p}. Then k{B) = k{B') and X{B) = X{B'). 

Proof. Define N G C^^f such that Njk = {Bj,Bk). Define N' G C^^p such that N'- 

B'-,B'i^). Clearly, A^' = qN, so their condition numbers are the same and k{B) = k{B'). 



jk 



For any A = Bj e C™^" and A' = B'-, we have 
A(^) = \{B'). ■ 



\\A'\\{nq) 



1/2 



m{nq) 



1/2 



l|A||^gV2 



||A||n 



1/2 



\\A\\, 



Hence, 



Appendix B. Probabilistic tools. In this section, we present some probabilistic results 
used in our proofs. The first theorem is used to decouple homogeneous Rademacher chaos of 
order 2 and can be found in [8, 23] for example. 

Theorem B.l. Let {ui) and (ui) be two iid sequences of real-valued random variables and 
Aij be in a Banach space where 1 < i, j < n. There exists universal constants Ci, C2 > such 
that for any s > 1, 



E 



s\ 1/s 



< CiC. 



1/s 



E 



uiUjAij 

l<i,j<n 



s\ 1/s 



(B.l) 



A homogeneous Gaussian chaos is one that involves only products of Hermite polynomials 
with the same total degree. For instance, a homogeneous Gaussian chaos of order 2 takes the 
form X^i<j-^j<„ Oidj^ij + Y17=ii9i ~ I* can be decoupled according to Arcones and Gine 

[2]. 

Theorem B.2. Let (ui) and (uj) be two iid Gaussian sequences and Aij be in a Banach 
space where 1 < i, j < n. There exists universal constants Ci, C2 > such that for any s > 1, 
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s\ 1/s 



< CiC. 
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s\ 1/s 



Remark B.l. For Rademacher chaos, Ci = 4 and C2 = 1. For Gaussian chaos, we can 
integrate Equation (2.6) of [2] (with m = 2) to obtain Ci = 2^/^ and C2 = 2^'^. Better 
constants may be available. 

We now proceed to the Khintchine inequalties. Let \\-\\q^ denote the s-Schatten norm. 
Recall that \\A\\c^ = {Y^iWil'^^' where fjj is a singular value of A. The following is due to 
Lust-Piquard and Pisier [18, 19]. 

Theorem B.3. Let s > 2 and (ui) be a Rademacher or Gaussian sequence. Then for any 
set of matrices {^i}i<i<nj 
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The factor s^/^ above is not optimal. See for example [3] by Buchholz, or [24, 26]. 

In [23], Theorem B.3 is applied twice in a clever way to obtain a Khintchine inequality for 
a decoupled chaos of order 2. 

Theorem B.4. Let s > 2 and (ui) and (ui) be two independent Rademacher or Gaussian 
sequences. For any set of matrices {^ij}i<i,j<n, 



E 



1/s 



< 2^/^*5 max( 












a' 





A\F\\c^,\\G\\cJ 



Cs 



where Q = I]i<ij<„ ^Ij^ij o.nd R = I]i<ij<„ ^ij^ij o^^d F, G are the block matrices {Aij)i<i,j<n, 
{^ij)i<i,j<n respectively. 

For Rademacher and Gaussian chaos, higher moments are controlled by lower moments, 
a property known as "hypercontractivity" [2, 8]. This leads to exponential tail bounds by 
Markov's inequality as we illustrate below. The same result appears as Proposition 6.5 of 
[24]. 

Proposition B.5. Let X be a nonnegative random variable. Let a,c,a > 0. Suppose 
(EX*)^/'* < crc^/^s^/" for all sq < s < oo. Then for any k > and u > s]^"" , 



X > e'^auj < cexp(-A:u"). 
Proof. By Markov's inequality, for any s > 0, P (X > e'^au) < 



< c 



5I/Q 



Pick 



s = > So to complete the proof. ■ 

Proposition B.6. Let (ui) be a Rademacher or Gaussian sequence and Ci,C2 be constants 
obtained from Theorem B.l or B.2. Let {^jj}i<j,j<n be a set of p by p matrices, and assume 
that the diagonal entries An are positive semidefinite. Define M = Yli'^i'^j^ij ^ ~ 
Ci max(||Q||^/2 , , ||F|| , ||G||) where Q,R,F, G are as defined in Theorem B.J^- Then 



\M - EM|| > eau) < {2G2np) exp(-n) 



Proof. We will prove the Gaussian case first. Recall that the s-Schatten and spectral 
norms are equivalent: for any A € C'"^'', ||^|| < ||^||(^^ < r'^^^ \\A\\. Apply the decoupling 
inequality, that is Theorem B.2, and deduce that for any s > 2, 



(E||M-iV 



1/. 



E 



l<i,j<n 



1/s 



Cs 



Invoke Khintchine's inequality, that is Theorem B.4, and obtain 



(E||M- A^ll 



A/s 



< Ci(2C2)^/"smax 



Q 



1/2 



Cs 



Cs 



Cs 



< Ci(2C2np)i/^smax(||Q||i/2 , \\R\\^'^ , ||F|| , ||G||) 

< (T{2G2npf/'s. 
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Apply Proposition B.5 with c = 2C2np and A; = a = 1 to complete the proof for the Gaus- 
sian case. For the Rademacher case, we take similar steps. First, decouple (E \\M — N\\^Y^^ 
using Theorem B.l. This leaves us a sum that excludes the AiiS. Apply Khintchine's inequal- 
ity with the AiiS zeroed. Of course, Q, R, F, G in Proposition B.4 will not contain any ^m's, 
but this does not matter because A*- An and AnA*- and An are all positive semidefinite for 
any I < i < n and we can add them back. For example, ||(Ajj)i<j^j<„|| < \\{Aij) 
block matrices. ■ 

An alternative way to prove the Gaussian case of Proposition B.6 is to split (E \\M — NW'^y/' 
into two terms (E — l)74jj||)^/* and (E UjUjjdjj ||)-^/^. For the second term, we can 

insert Rademacher variables, condition on the Gaussians, decouple the Rademacher sum and 
apply Theorem B.4. After that, we can pull out the maximum of all the Gaussians from 
Q, R, F, G. Nevertheless, as this may introduce extra log n factors, we prefer to simply appeal 
to [2] to decouple the Gaussian sum right away. 
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